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Abstract —This paper introduces a new probabilistic model 
for online learning which dynamically incorporates information 
from stochastic gradients of an arbitrary loss fnnction. Similar 
to probabilistic Altering, the model maintains a Gaussian belief 
over the optimal weight parameters. Unlike traditional Bayesian 
updates, the model incorporates a small number of gradient eval¬ 
uations at locations chosen using Thompson sampling, making it 
computationally tractable. The belief is then transformed via a 
linear Bow Reid which optimally updates the belief distribution 
using rules derived from information theoretic principles. Several 
versions of the algorithm are shown using different constraints 
on the Row Reid and compared with conventional online learning 
algorithms. Results are given for several classiBcation tasks 
including logistic regression and multilayer neural networks. 

I. Introduction 

An number of problems in artificial intelligence must cope 
with continual learning tasks involving very large datasets or 
even data streams which have become ubiquitous in appli¬ 
cation domains such as life-long learning, computer vision, 
natural language processing, bioinformatics and robotics. As 
these big-data problems demand richer models, novel online 
algorithms are needed that scale efficiently both in accuracy 
and computational resources. Recent work have shown that 
complex models require regularization to avoid local minima 
and overfitting—even when the data is abundant [1]. 

The aim of our work is to formulate an efficient online 
learning approach that leverages the advantages of stochastic 
gradient descent (SGD) [2] and Bayesian filtering [3]. Many 
learning tasks can be cast as optimization problems, and SGD 
is simple, scalable and enjoys strong theoretical guarantees in 
convex problems [4], [5] but tends to overfit if not properly 
regularized. On the other hand, Bayesian filtering methods 
track belief distributions over the optimal parameters to avoid 
overfitting, but are typically computationally prohibitive for 
rich models. To combine these two approaches, we had to 
address two questions. 

The first question is how to update a global belief distribu¬ 
tion over optimal parameters from a local update prescribed 
by SGD. In other words, rather than calculating the posterior 
using a likelihood function, we take gradients as directly 
specifying the velocity field— or flow field —of belief updates. 
As shown in Eigure this can be viewed as tracking an 
ensemble of models under the dynamics induced by the 
expected prediction error. We answer this question by using 





Fig. I. Schematic comparison of learning dynamics in (a) stochastic gradient 
descent, (b) Bayesian filtering, and (c) belief flows. 

the principle of minimum information discrimination (MID) 
[6] to choose the most conservative posterior belief that is 
consistent with the gradient measurement and assumptions on 
the flow field. 

The second question is how to generate predictions without 
integrating over the parameter uncertainty. Calculating the 
optimal belief update would require estimating the expected 
SGD update at every point in parameter space, which is 
prohibitive. To overcome this problem, a natural choice is to 
use Thompson sampling, i.e. sampling parameters from the 
posterior according to the probability of being optimal [7]. As 
is well known in the literature in Bayesian optimization [8], 
optimizing the parameters of an unknown and possibly non- 
convex error function requires dealing with the exploration- 
exploitation trade-off, which is precisely where Thompson 
sampling has been shown to outperform most state-of-the-art 
methods [9]. 

Here, we illustrate this modelling approach by deriving the 
update rules for Gaussian belief distributions over the optimal 
parameter. By assuming that observations generate linear flow 
fields, we furthermore show that the resulting update rules 
have closed-form solutions. Because of this, the resulting 
learning algorithms are online, permitting training examples 
to be discarded once they have been used. 

H. Gaussian Belief Elows 

We focus on prediction tasks with parameterized models, 
each denoted by Fu,{x) with inputs x S and parameters 
w S R"^. At each round n = 1,2,... the algorithm maintains a 
belief distribution Pn{w) over the optimal parameters. Here we 
choose Pn{w) to be represented by a d-dimensional Gaussian 











with mean /i„ and covariance S„: 


Pn{w) = N{w;fj.n, S„) 
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In each round, the algorithm samples a parameter vector 
Wn from the distribution Pn{w). The components of Wn 
are then used as parameters in a classification or regression 
machine for a given input Xn- For example, we can consider 
logistic regression where are the parameters of the logistic 
function. Or we can use deep networks, where Wn specifies 
the synaptic weights of the different layers of the network. 
Without loss of generality, this yields a prediction for the 
particular input Xn- 


Vn — ■ 


A supervised output signal is also provided, and we wish 
to minimize the loss between the true output and predicted 
output: 

Vn)- 


To improve the loss on the current example, SGD then updates 
the parameter as: 

d 

w'n= Wn- r] ■ —t{yn,yn)\w=wr,^ ( 1 ) 

where p > 0 is a learning rate which may vary as the number 
of rounds increase. For the case of a multilayer perceptron, 
this gradient can be efficiently computed by the well-known 
backpropagation algorithm in a single backwards pass. 


A. Full Flow 

Knowing that the sampled parameter vector Wn needs to be 
modified to how do we choose the posterior Pn+i{w)l To 
answer this question, we assume that the update results from 
a linear flow field, i.e. each w is updated as 

w' = Aw + b 

where A G is an affine transformation matrix and b gM.‘^ 
is a translational offset. The main advantage of a linear flow is 
that Gaussian distributions remain Gaussian under such a held 
[10]. In particular, we choose the flow parameters to match the 
SGD update Q: 

w'n = AWn + b. (2) 

Under this linear flow, the new belief distribution is 

Pn+l{w) = N{AfJ,n + b, AYinA^) = N{fJ,n+l, 

where the mean is shifted to y,n+i = Ayn + b and the 
covariance is scaled to S„_|_i = There are many 

potential A and b which satisfy the flow constraint in (|^, so 
we need a way to regularize the ensuing distribution Pn+i{w). 
We utilize the Kullback-Leibler divergence 

ininDKL [Pn+i\\Pn] = min [ P„+i(w) log 
A,b A,b J Pn{w) 

( 3 ) 


subject to the constraint w'n = Awn+b. This is an application 
of the principle of MID [6], an extension of the maximum 
entropy principle [11], and governs the updates of exponential 
family distributions [12]. The next theorem (see Appendix for 
the proof) shows that this problem has a closed-form solution. 

Theorem 1: Let be the eigendecomposi- 

tion of the covariance matrix. Let 


1 

uy = -yf^Un [Wn - yn) 

V 

V\\y + VxD = -J=Un [w'n - yn) 
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be the transformed (whitened) differences between the sam¬ 
pled parameter vector and the mean before and after the update 
respectively, expressed in terms of the 2-D basis spanned by 
the unitary orthogonal vectors y and D. Then, the solution A* 
to ([^ is 


Idxd + Un'/i\^[y D] (A 2 X 2 -F 2 X 2 ) 


V Dn 


u' 


where the 2-D transformation matrix ^ 2 x 2 is given by 



uJv'^.+v\+Si Ji+u^ ( 4 + 1 )]] +v \) 

2(l +«2) ^11 -^2^^ 

uJv'^.+v\+&i J (i+v'^.+v\) 

— - 2 ^+d^) - 


G {—The hyperparameters of the posterior 
distribution are then equal to 


P^n+l — A YjnA yn-\-l — A [yn Wn) “t“ Wn- 

There are actually four discrete solution to ([^, one for each 
combination of (5i and 62- If we assume that the SGD learning 
rate is small, then w'n — yn ~ Wn — yn- In this case, we 
should choose the solution A* « Idxd obtained by selecting 
61 = S2 = 1. 


B. Diagonal and Spherical Flows 

We now consider two special cases: flows with diagonal and 
spherical transformation matrices, i.e. of the form 

A = diag(ai, 02 ,..., aa) and A = all, 

where diag(ti) denotes the square diagonal matrix with the el¬ 
ements of the vector v on the main diagonal and [/ is a unitary 
(rotation) matrix. We match these flow types with multivariate 
Gaussians having diagonal and spherical covariance matrices. 
Let N[y,Tj) be the prior and N[y',H') be the posterior at 
round n, and let subindices denote vector components in the 
following. 

a) Diagonal:: For the diagonal case, the multivariate 
distribution N[y, E) factorizes into d univariate Gaussians 
N[yi, al) that can be updated independently under flow fields 
of the form 

w'^ = UiWi + bi. (4) 

Under these constraints, it can be shown (see Appendix) that 
the optimal transformation is given by 

* _ UjWj -I- S^^y4: + uf[4: -G vf) 

2(1+uf) 
























where Ui = {wi — pL^jui, Vi = (w- — pLi)l<Ji are the i-th 
component of the normalized sampled parameter before and 
after updating and 8 i G {—1,+1}. Similarly to the general 
case, we choose the solution closer to the identity when Ui « 
Vi by picking Si = +1. The posterior hyperparameters are 

a' = a* /r' = a*- Wi) + w[. (5) 

where /r' and a[ are the new mean and standard deviation of 
the i-th component. 

b) Spherical:: For the spherical case, the flow field that 
preserves the spherical distribution is of the form A = all, 
where a > 0 is a scalar and {7 is a unitary matrix such that 
A{w — p) and {w' — p) are colinear; that is, A rotates and 
scales {w — p) to align it to {w' — p). The update is 

w' = aw + b, (6) 

that is, similar to but with an isotropic scaling factor a. 
The optimal scaling factor is then given by 

uv + (5-y/4 + m2(4 + 


where u = (||u; — p\\)/(J, v = (||w' — p\\)/(J, where 
S G {—1,+1} is choosen as <5 = 1 for the near-identity 
transformation. 

C. Non-Expansive Flows 

The previously derived update rules allow flow fields to 
be expansive, producing posterior distributions having larger 
differential entropy than the prior. Such flow fields are needed 
when the error landscape is dynamic, e.g. when the data is 
nonstationary. However, for faster convergence with stationary 
distributions, it is desirable to restrict updates to non-expansive 
flows. Such flow fields are obtained by limiting the singular 
values of the transformation matrix A to values that are smaller 
or equal than one. 

D. Implementation 

The pseudocode of a typical gradient-based online learning 
procedure is listed in Algorithm [T] For a d-dimensional 
multivariate Gaussian with diagonal and spherical covariance 
matrix, the update has time complexity 0 {d), while for an 
unconstrained covariance matrix, this update is 0 {(P) in a 
naive implementation performing a spectral decomposition in 
each iteration. The complexity of the unconstrained covariance 
implementation can be reduced using low-rank techniques as 
described in [13]. Numerically, it is important to maintain the 
positive semidefiniteness of the covariance matrix. One simple 
way to achieve this is by constraining its eigenvalues to be 
larger than a predefined minimum. 

III. Properties 

A. Comparison 

The full optimal transformation will include rotations in the 
2-D subspace spanned by the sampled and learned parameter 
vectors. In contrast, the diagonal transformation will only 
scale and translate each basis direction independently, and 


Algorithm 1 BFLO Pseudo-Code 
Input: hyperparameters 

for n = 1,2,... ,N do 

Get training example (a;„, 2 /„). 

Calculate output: 

Sample S„) 

Set Zn ^ 

Local flow: 

Calculate new weights using e.g. gradient descent, 

w'^^Wn- 'n§^{.yn,Zn)\w„ 

Global flow: 

Calculate flow matrix A* (full, diagonal or spherical). 
Update hyperparameters: 

^n+l G- A*JlnA*^ 

Pn+1 ^ A*{pn - Wn) + w'^ 

(Optional) perform numerical correction. 

end for 

Return (/r at, Sat) 


the spherical transformation acts on the radial parameter only. 
Since rotations allow for more flexible transformations, the 
full flow update will keep the belief distribution relatively 
unchanged, while the diagonal and spherical flows will force 
the belief distribution to shift and compress more. The update 
that is better at converging to an optimal set of weights is 
problem dependent. The three update types are shown in 
Figure [^. 

B. Update Rule 

To strenghten the intuition, we briefly illustrate the non¬ 
trivial effect that the update rule has on the belief distribution. 
Figure shows the values of the posterior hyperparameters 
of a one-dimensional standard Gaussian as a function of A = 
{w — p) and A' = {w' — p), that is, the difference of the 
sampled weight and the center of the prior Gaussian before 
and after the update. 

Roughly, there are three regimes, which depend on the 
displacement A' — A = w' — w. First, when the displacement 
moves towards the prior mean without crossing it, then the 
variance decreases. This occurs when 0 < A'A < A^, that is, 
below the diagonal in the first quadrant and above the diagonal 
in the third quadrant. Second, if the displacement crosses the 
prior mean, then the flow field is mainly explained in terms 
of a linear translation of the mean. This corresponds to the 
second and fourth quadrants. Finally, when the displacement 
moves away from the prior mean, the posterior mean follows 
the flow and the variance increases. This corresponds to the 
regions where A^ < A'A, i.e. above the diagonal in the first 
and below the diagonal in the third quadrant. 

Note that the diagonal A' = A leaves the two hyperparam¬ 
eters unchanged, and that the vertical A = 0 does not lead to a 
change of the variance. Figure ^ illustrates the change of the 
prior into a posterior belief. The left panel shows the update of 
a sampled weight equal to the standard deviation, and the right 
panel show the update for a sampled weight equal to one-fifth 








Fig. 2. a) Comparison of different flow types, b) 1-D Update of the mean fi = 0 and standard deviation ct = 1 as a function of A = {w — fi) and 
A' = (w' — n). b) The two panels illustrate the posterior beliefs (red) resulting from updating a 1-D prior (black) by moving a sampled weight through 
displacements in {—4, —2, -|-2, -1-4}. The prior and posterior positions of the sampled weight are indicated with dashed vertical lines. 
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(subtraction) of datapoints that attract (repel) the mean shown 
in black by relocating the center of mass. In the figure, 
blue and red correspond to added and subtracted datapoints 
respectively, and the circular areas indicate their precision, 
i.e. pn = 1/A„, where A„ S K is the unique eigenvalue of 
R at round n. The convergence of the belief distribution to a 
particular point 2 S can thus be analyzed in terms of the 
convergence of the weighted average of the pseudo datapoints 
to z and accumulation of precision, that is —>■ -boo. 


Fig. 3. Evolution of Pseudo Dataset 


of the standard deviation. Here, it is seen that if the sampled 
weight is closer to the mean, then the update is reflected in a 
mean shift with less change in the variance. 


C. Pseudo Datasets 

The belief updates can be related to Bayesian updates 
through (pseudo) datapoints that would yield the same pos¬ 
terior. Since a belief flow update ensures that the posterior 
stays within the Gaussian family, it is natural to relate it 
to Bayesian estimation of an unknown mean (but known 
covariance) under the self-conjugate Gaussian family. More 
precisely, let P{w) = N{w,p,,Yi) and P'{w) = 7V(w;/j,', E') 
denote the prior and the posterior of a belief flow update. Then, 
it is easily verified that this is equivalent to conditioning a 
prior P{w) = N{w; p, E) on a point u with likelihood 
function P{x\w) = N{x;w, R), where 

x= (E'-i-E-i)-i(E'-V'-S”V), (7) 

i?= (E'-i-E-i)-i. (8) 


The matrix R is symmetric but not necessarily positive 
semidefinite unless we use non-expansive flows. A negative 
eigenvalue A of i? then indicates an increase of the variance 
along the direction of its eigenvector v\ G From a 
Bayesian point of view, this implies that the pseudo datapoint 
was removed (or forgotten), rather than added, along the 
direction of «>. We have already encountered this case in the 
1-D case in Section III-B when the posterior variance increases 
due to a displacement that points away from the mean. 

Figure]^ shows the pseudo dataset for a sequence of updates 
of a spherical belief flow. The temporal dynamics of the 
belief distribution can be thought of as driven by the addition 


IV. Empirical Evaluation 

We evaluated the diagonal variant of Gaussian belief flows 
(BELO) on a variety of classification tasks by training logis¬ 
tic regressors and multilayer neural networks. These results 
were compared to several baseline methods, most importantly 
stochastic gradient descent (SGD). Our focus in these experi¬ 
ment was not to show better performance to existing learning 
approaches, but rather to illustrate the effect of different regu¬ 
larization schemes over SGD. Specifically, we were interested 
in the transient and steady-state regimes of the classifiers to 
measure the online and generalization properties respectively. 

A. Logistic Regression 

Eor this model, we compared Gaussian belief flows (BELO) 
on several binary classification datasets and compared its per¬ 
formance to three learning algorithms: AROW [14], stochastic 
gradient descent (SGD) and Bayesian Langevin dynamics 
(BLANG) [1]. With the exception of AROW, which combines 
large margin training and confidence weighting, these are 
all gradient-based learning algorithms. The algorithms were 
used to train a logistic regressor described as follows. The 
probability of the output y G {0,1} given the corresponding 
input X G is modelled as 

P(y = l|x) = a(w'^x), (9) 

where w G is a parameter vector and a(t) = 

is the logistic sigmoid. When used in combination with the 

binary KL-divergenc^ 

£(z,y) = y log-+ (1 - y) log 4 (10) 

z (1 -z) 

^Equivalently, one can use the binary cross-entropy defined as i(y^z) = 
-y\ogz- (1 - y) log(l - z). 



















































as the error function, the error gradients become: 


TABLE II 

MNIST Classification Results 


^ = ( 2 ;- 2 /) 2 ;- ( 11 ) 

We measured the online and generalization performance in 
terms of the number of mistakes made in a single pass through 
80% of the data and the average classification error without 
updating on the remaining 20% respectively. Additionally, to 
test robustness to noise, we repeated the experiments, but 
inverting the labels on 20% of the training examples (the 
evaluation was still done against the true labels). 

To test the performance in the online setting, we selected 
well-known binary classification datasets having a large num¬ 
ber of instances, summarized as follows. MUSHROOM: 
Physical characteristics of mushrooms, to be classified into 
edible or poisonous. This UCI dataset contains 8124 instances 
with 22 categorical attributes each that have been expanded 
to a total of 112 binary feature^ COVTYPE: 581,012 forest 
instances described by 54 cartographic variables that have to 
be classified into 2 groups of forest cover types [15]. IJCNN: 
This is the first task of the IJCNN 2001 Challenge [16]. We 
took the winner’s preprocessed dataset [17], and balanced 
the classes by using only a subset of 27,130 instances of 
22 features. EEG: This nonstationary time series contains a 
recording of 14,980 samples of 14 EEG channels. The task 
is to discriminate between the eye-open and eye-closed stat^ 
A9A: This dataset, derived from UCI Adult [18], consists of 
32,561 datapoints with 123 features from census data where 
the aim is to predict whether the income exceeds a given 
threshold. 

We used grid search to choose simple experimental pa¬ 
rameters that gave good results for SGD, and used them 
on all datasets and gradient-based algorithms to isolate the 
effect of regularization. In particular, we avoided using popular 
tricks such as “heavy ball”/momentum [19], minibatches, and 
scheduling of the learning rate. SGD and BLANG were 
initialized with parameters drawn from A^(0,tT^) and BELO 
with a prior equal to W(0, cr^), where a — 0.2. The learning 
rate was kept fixed at p = 0.001. Eor the AROW parameter 
we chose r = 10. With the exception of EEG, all datasets 
were shuffled at the beginning of a run. 

Table |I] summarizes our experimental results averaged over 
10 runs. The online classification error is given within a 
standard error cjen < 0.01. The last column lists the mean 
rank (out of 4, over all datasets), with 1 indicating an algorithm 
attaining the best classification performance. BELO falls short 
in convergence speed due to its exploratory behavior in the 
beginning, but then outperforms the other classifiers in its 
generalization ability. Compared to the other methods, it 
comes closest to the performance of BLANG, both in the 
transient and in the steady-state regime, and in the remarkable 
robustness to noise. This may be because both methods use 
Monte Carlo samples of the posterior to generate predictions. 

^http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html 

^https://archive.ics.uci.edu/ml/datasets/EEG+Eye-l-State 


Online Classification Error in % 


PLAIN 

RANDOM 

IMAGES 

Rank 

max{(7err } 

0.07 

0.96 

1.16 


SGD 

11.25 

89.14 

72.41 

3.0 

DROPOUT 

9.84 

52.87 

50.68 

1.6 

BFLO 

11.01 

37.94 

47.71 

1.3 

Final Classification Error in % 


PLAIN 

RANDOM 

IMAGES 

Rank 

max{(7err } 

0.44 

3.33 

6.05 


SGD 

7.01 

89.17 

65.17 

3.0 

DROPOUT 

5.52 

53.42 

46.67 

2.0 

BFLO 

5.00 

29.11 

41.55 

1.0 


B. Feedforward Neural Networks 

This experiment investigates the application of Belief Elows 
to learn the parameters of a more complex learning machine— 
in this case, a feedforward neural network with one hidden 
layer. We compared the online performance of Gaussian belief 
flows (BELO) to two learning methods: plain SGD and SGD 
with dropout [20]. As before, our aim was to isolate the effect 
of the regularization methods on the online and test perfor¬ 
mance by choosing a simple experimental setup with shared 
global parameters, avoiding architecture-specific optimization 
tricks. 

We tested the learning algorithms on the well-known 
MNIST|^ handwritten digit recognition task (abbreviated here 
as BASIC), plus two variations derived in [21]: RANDOM: 
MNIST digits with a random background, where each random 
pixel value was drawn uniformly; IMAGES: a patch from a 
black and white image was used as the background for the digit 
image. All datasets contained 62,000 grayscale, 28 x 28-pixel 
images (totalling 784 features). We split each dataset into an 
online training set containing 80% and a test set having 20% 
of the examples. 

To attain converging learning curves in the (single-pass) 
online setting, we have chosen a modest architecture of 784 
inputs, 200 hidden units and 10 outputs with aggressive 
updates. All units have a logistic sigmoid activation function, 
and error gradients were evaluated on the binary Kullback- 
Leibler divergence function averaged over the outputs. At the 
beginning of each run, the examples were shuffled and the 
training initialized with weights either drawn independently 
from a normal W(0, tr^) (SGD and dropout) or by setting 
the prior to W(0, cr^) (BELO), where cr = 0.1. Throughout 
the online learning phase, the learning rate was kept fixed at 
T] = 0.2 and we applied 5 update iterations on each example 
before discarding it. 

We report the results in Table |II] which were averaged 
over 5 runs. As the level of noise increased, all the classifiers 
declined in performance, with PLAIN being the easiest and 
RANDOM being the hardest to learn. SGD had a particularly 
poor behavior relative to the regularized learners with their 

“^httpi/Zy ann.lecun.com/exdb/mnist/index.html 









TABLE I 


Binary Classification Results for Noise Levels 0%/20% 


Online Classification Error in % 


MUSHR. 

COVTYPE 

IJCNN 

EEG 

A9A 

Rank 

AROW 

SGD 

BLANG 

BFLO 

5.32/11.05 

11.86/11.05 

14.44/14.35 

14.30/15.34 

22.58/23.10 

28.03/28.18 

29.30/29.44 

28.14/28.39 

8.44/17.38 

9.01/9.87 

12.86/12.06 

10.34/11.52 

43.59/45.11 

43.39/44.22 

43.71/44.20 

44.07/44.37 

17.79/20.95 

18.62/18.48 

20.51/20.20 

19.03/19.04 

1.2/2.8 

1.8/1-6 

3.2/2.8 

3.8/2.8 

Final Classification Error in % 

max{(7err } 

MUSHR. 

0.23/0.32 

COVTYPE 

0.12/0.40 

IJCNN 

0.26/1.23 

EEG 

0.69/1.26 

A9A 

0.08/0.12 

Rank 

AROW 

SGD 

BLANG 

BFLO 

9.59/13.77 

5.35/10.78 

1.16/0.34 

1.79/0.65 

37.18/38.11 

37.45/38.36 

38.39/39.09 

37.03/37.69 

20.10/20.28 

19.10/26.52 

15.97/14.60 

16.92/16.67 

65.38/63.09 

60.57/61.24 

64.85/66.68 

62.76/62.35 

15.85/17.56 

17.45/19.32 

17.68/19.58 

17.00/16.69 

3.0/2.8 

2.6/2.8 

2.6/2.8 

1.8/1.6 


built-in mechanisms to avoid local minima. BFLO attained 
the lowest error rates both online and in generalization. 
Interestingly, it copes better with the RANDOM than with 
the IMAGES dataset. This could be because Monte Carlo¬ 
sampling smoothens the gradients within the neighborhood. 

V. Discussion and Future Work 

Our experiments indicate that Belief Flows methods are 
suited for difficult online learning problems where robustness 
is a concern. Gaussian belief flows may be related to ensemble 
learning methods in conjunction with locally quadratic cost 
functions. A continous ensemble of predictors that is trained 
under gradient descent follows a linear velocity field when 
the objective function is a quadratic form. Under such flow 
fields, the Gaussian family is a natural choice for modelling 
the ensemble density since it is invariant to linear velocity 
fields. An iteration of the update rule then infers the dynamics 
of the whole ensemble from a single sample by conservatively 
estimating the ensemble motion in terms of the Kullback- 
Feibler divergence. 

Following the Bayesian rationale, the resulting predictor 
is an ensemble rather than a single member. Any strategy 
deciding which one of them to use in a given round must 
deal with the exploration-exploitation dilemma; that is, striking 
a compromise between minimizing the prediction error and 
trying out new predictions to further increase knowledge [22]. 
This is necessary to avoid local minima-here we sample a 
predictor according to the probability of it being the optimal 
one, a strategy known as Thompson sampling [7], [23], [24]. 
The maintainence of a dynamic belief in this manner allows 
the method to outperform algorithms that maintain only a 
single estimate in complex learning scenarios. 

The basic scheme presented here can be extended in many 
ways. Some possibilities include the application of Gaus¬ 
sian belief flows in conjunction with kernel functions and 
gradient acceleration techniques, and in closed-loop setups 
such as in active learning. Furthermore, linear flow fields can 
be generalized by using more complex probabilistic models 
suitable for other classes of flow fields following the same 
information-theoretic framework outlined in our work. Finally, 


future theoretical work includes analyzing regret bounds, and a 
more in-depth investigation of the relation between stochastic 
approximation, reinforcement learning and Bayesian methods 
that are synthesized in a belief flow model. 

Appendix I 
Proof of Theorem 1 

Proof: The KF divergence is given by: 

Dkl(P„+i||P„)= f Pu+i{w) log dw. 

J Pn{w) 

Then for Gaussians in this divergence is: 

l^ipn+l ~ Pn) P'n ipn+l ~ Pn) + P‘n+1 

- ^logdetE;[^S„+i - 
The constraint on the flow implies: 

b = w'„ — AWn 

Pn-\-l Pn — pn) A{Wn p7i) — A^ji 

where A„ = Wn — Pn and = w'^ — fin- So the optimization 
can be written in terms of the matrix A\ 

nnni [A^ - AA„]^ [A'„ - AA„] 

+ iTrS“ME„A^ - i logdet 

We first note that minimizing the KF-divergence implies that 
the transformation A is full-rank, since collapsing the rank of 
the covariance matrix will lead to infinite divergence. Taking 
the derivative with respect to A yields 

-E-i [A; - AA„] Al - E-iAE„ - A"^ = 0, 

which can be rewritten as 

S„ = AS„A^ - (A:, - AA„)(AA„)^ 

= A (S„ + A„A^) A^ - KiAA„f. 



We first see that if = A„, then A = I the solution where 
the flow field is invariant. The expression can be simplified if 
we consider the diagonalization of the covariance matrix: 


In that case, we transform the variables: 


A, 


1 


1 






\/Dn 


U^AUnV^n- 


So, in terms of these transformed variables, the optimal 
condition becomes: 


J = i(J + A„A^)i^-A;(iA„f. 


( 12 ) 


The general solution can be found by considering the 2-D 
basis spanned by the vectors A„ and A^, with unit vectors /t 
and V. 


An = Ufl 


An = vi\fl + v±i>. (13) 


The optimal matrix A is just identity in the other D — 2 
directions orthogonal to this 2-D subspace. Then the optimality 
condition in ([T2li can be written in terms of the 2x2 matrix 


^2x2 — 


Ant, 


A 

A 




(14) 


as 


Putting these together, we get that the optimal transformation 
matrix is: 



uJv'^^+v\±Ji+u'^{A+v'^^+v'\) 

2( 1+m^) ^11 

u^v'^^+v\±^i+u^(A+v'^^+v\) 

2 ( 1 + 1 ?) 


±U|| 


We see that there are actually four discrete solutions for 
^ 2 x 2 - First, one of two roots (one positive, one negative) for 
a can be chosen, and then the signs for Af,n and Am, can be 
swapped. However, if we consider the situation where A^ is 
not far from A„, then we should choose the solution connected 
to the identity matrix I. This means selecting the positive root 
for a, and ensuring the diagonal terms of 4l2x2 are positive. 
The full matrix solution A* can be expressed in terms of this 
^ 2 x 2 as: 


where 


A* =lD,<D+Un^M^U^ (16) 

V 


M — <[ fl ] (^2x2 — ^ 2 x 2 ) 


The parameters for the new belief distribution in the next 
round are then given in terms of A*: 

En+l = A*T,nA*^ Hn+1 = A* {Hn - Wn) + w'n- (17) 


This concludes our proof. 


^2x2 


1 + m2 

0 ■ 


V\\U 

0 ■ 

0 

1 

^2x2 

v±u 

0 


I. (15) 


To solve this quadratic matrix equation, we hrst note that 
symmetry of the matrices implies that the solution must 
satisfy: A^||(AA„). This means that the solution must be of 
the restricted form: 


4l2x2 — 


av\\ 

av± 


Then in terms of the unknowns 
the following three equations: 


At,n ' 

Am, 

a, Af,n, and A^n, we have 


1 = ujj [(1 + - ua] + 

1 = Vj_ [^(1 T U )g: — tXCtj Ann 
0 = U||t;+ [(1 + u‘^)a^ - ua] + A^^nAnn- 


Fortunately, we can determine the solution analytically in 
closed form: 


Ann = T 




(1 + u'^)a^ — ua = 


Ann = ± 
1 


^11 


vf^+vl' 


The quadratic formula then yields: 


u ± 



4(l+ll2) 


2(1 -t- m2) 


A. Special Cases 

The optimal solution for the diagonal flow is obtained as a 
special case of •EH) where A(j || A„ and m+ = 0. Let af be 
the variance of along the i-th component. If we rescale 
the parameter components according to ct? as 

Wi - Pi w'i- Pi 

Ui = - Vi = -, 

O'i 

then the optimal scaling parameter is given by 

_ UiVi ± -v/4TMf(4T^ 

2(1 +m2) 

For the spherical case, we begin by noting that the covariance 
and the transformation matrices are all isotropic, i.e. of the 
form M = mU, where to is a scalar and U is unitary. 
Because of this, the multivariate problem effectively reduces 
to a univariate problem where A = all hrst rotates {w — p) 
to align it to {w' — p) and then scales and translates the 
distribution along this axis. 
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